Goto

Collaborating Authors

 gradient descent training


Explicit loss asymptotics in the gradient descent training of neural networks

Neural Information Processing Systems

Current theoretical results on optimization trajectories of neural networks trained by gradient descent typically have the form of rigorous but potentially loose bounds on the loss values. In the present work we take a different approach and show that the learning trajectory of a wide network in a lazy training regime can be characterized by an explicit asymptotic at large training times. Specifically, the leading term in the asymptotic expansion of the loss behaves as a power law $L(t) \sim C t^{-\xi}$ with exponent $\xi$ expressed only through the data dimension, the smoothness of the activation function, and the class of function being approximated. Our results are based on spectral analysis of the integral operator representing the linearized evolution of a large network trained on the expected loss. Importantly, the techniques we employ do not require a specific form of the data distribution, for example Gaussian, thus making our findings sufficiently universal.


Explicit loss asymptotics in the gradient descent training of neural networks

Neural Information Processing Systems

Current theoretical results on optimization trajectories of neural networks trained by gradient descent typically have the form of rigorous but potentially loose bounds on the loss values. In the present work we take a different approach and show that the learning trajectory of a wide network in a lazy training regime can be characterized by an explicit asymptotic at large training times. Specifically, the leading term in the asymptotic expansion of the loss behaves as a power law L(t) \sim C t {-\xi} with exponent \xi expressed only through the data dimension, the smoothness of the activation function, and the class of function being approximated. Our results are based on spectral analysis of the integral operator representing the linearized evolution of a large network trained on the expected loss. Importantly, the techniques we employ do not require a specific form of the data distribution, for example Gaussian, thus making our findings sufficiently universal.


On gradient descent training under data augmentation with on-line noisy copies

arXiv.org Machine Learning

In machine learning, data augmentation (DA) is a technique for improving the generalization performance. In this paper, we mainly considered gradient descent of linear regression under DA using noisy copies of datasets, in which noise is injected into inputs. We analyzed the situation where random noisy copies are newly generated and used at each epoch; i.e., the case of using on-line noisy copies. Therefore, it is viewed as an analysis on a method using noise injection into training process by DA manner; i.e., on-line version of DA. We derived the averaged behavior of training process under three situations which are the full-batch training under the sum of squared errors, the full-batch and mini-batch training under the mean squared error. We showed that, in all cases, training for DA with on-line copies is approximately equivalent to a ridge regularization whose regularization parameter corresponds to the variance of injected noise. On the other hand, we showed that the learning rate is multiplied by the number of noisy copies plus one in full-batch under the sum of squared errors and the mini-batch under the mean squared error; i.e., DA with on-line copies yields apparent acceleration of training. The apparent acceleration and regularization effect come from the original part and noise in a copy data respectively. These results are confirmed in a numerical experiment. In the numerical experiment, we found that our result can be approximately applied to usual off-line DA in under-parameterization scenario and can not in over-parametrization scenario. Moreover, we experimentally investigated the training process of neural networks under DA with off-line noisy copies and found that our analysis on linear regression is possible to be applied to neural networks.


Universal scaling laws in the gradient descent training of neural networks

arXiv.org Machine Learning

Current theoretical results on optimization trajectories of neural networks trained by gradient descent typically have the form of rigorous but potentially loose bounds on the loss values. In the present work we take a different approach and show that the learning trajectory can be characterized by an explicit asymptotic at large training times. Specifically, the leading term in the asymptotic expansion of the loss behaves as a power law $L(t) \sim t^{-\xi}$ with exponent $\xi$ expressed only through the data dimension, the smoothness of the activation function, and the class of function being approximated. Our results are based on spectral analysis of the integral operator representing the linearized evolution of a large network trained on the expected loss. Importantly, the techniques we employ do not require specific form of a data distribution, for example Gaussian, thus making our findings sufficiently universal.


Universality of Gradient Descent Neural Network Training

arXiv.org Machine Learning

It has been observed that design choices of neural networks are often crucial for their successful optimization. In this article, we therefore discuss the question if it is always possible to redesign a neural network so that it trains well with gradient descent. This yields the following universality result: If, for a given network, there is any algorithm that can find good network weights for a classification task, then there exists an extension of this network that reproduces these weights and the corresponding forward output by mere gradient descent training. The construction is not intended for practical computations, but it provides some orientation on the possibilities of meta-learning and related approaches.


On the Convergence of Gradient Descent Training for Two-layer ReLU-networks in the Mean Field Regime

arXiv.org Machine Learning

We describe a necessary and sufficient condition for the convergence to minimum Bayes risk when training two-layer ReLU-networks by gradient descent in the mean field regime with omni-directional initial parameter distribution. This article extends recent results of Chizat and Bach to ReLU-activated networks and to the situation in which there are no parameters which exactly achieve MBR. The condition does not depend on the initalization of parameters and concerns only the weak convergence of the realization of the neural network, not its parameter distribution.